Scalable Simple Random Sampling and Stratified Sampling
نویسنده
چکیده
Analyzing data sets of billions of records has now become a regular task in many companies and institutions. In the statistical analysis of those massive data sets, sampling generally plays a very important role. In this work, we describe a scalable simple random sampling algorithm, named ScaSRS, which uses probabilistic thresholds to decide on the fly whether to accept, reject, or wait-list an item independently of others. We prove, with high probability, it succeeds and needs only O( √ k) storage, where k is the sample size. ScaSRS extends naturally to a scalable stratified sampling algorithm, which is favorable for heterogeneous data sets. The proposed algorithms, when implemented in MapReduce, can effectively reduce the size of intermediate output and greatly improve load balancing. Empirical evaluation on large-scale data sets clearly demonstrates their superiority.
منابع مشابه
An Evaluation of Stratified Sampling of Microarchitecture Simulations
Recent research advocates applying sampling to accelerate microarchitecture simulation. Simple random sampling offers accurate performance estimates (with a high quantifiable confidence) by taking a large number (e.g., 10,000) of short performance measurements over the full length of a benchmark. Simple random sampling does not exploit the often repetitive behaviors of benchmarks, collecting ma...
متن کاملMethod of Fuzzy Ratio Estimate
This article develops a method of ratio estimate in fuzzy sense. By both the simple random sampling and stratified random sampling, we can obtain the ratio estimate in usual statistical sense. However, the sampling data may be ambiguous in some uncertain circumstance. To solve such kind of problem, we probe into the simple random sampling and stratified random sampling in fuzzy sense, obtain th...
متن کاملSampling Survey of Heavy Metal in Soil Using SSSI
Much attention has been given to sampling design, and the sampling method chosen directly affects the sampling accuracy. The development of spatial sampling theory has lead to the recognition of the importance of taking spatial dependency into account when sampling. This text uses the new Sandwich Spatial Sampling and Inference (SSSI) software as a tool to compare the relative error, coefficien...
متن کاملComparison of Sampling Techniques on the Performance of Monte- Carlo Based Sensitivity Analysis
Sensitivity analysis is a key part of a comprehensive energy simulation study. Monte-Carlo techniques have been successfully applied to many simulation tools. Several sampling techniques have been proposed in the literature; however to date there has been no comparison of their performance for typical building simulation applications. This paper examines the performance of simple random, strati...
متن کاملPerfect and Maximum Randomness in Stratified Sampling over Joins
Supporting sampling in the presence of joins is an important problem in data analysis. Pushing down the sampling operator through both sides of the join is inherently challenging due to data skew and correlation issues between output tuples. Joining simple random samples of base relations typically leads to results that are non-random. Current solutions to this problem perform biased sampling o...
متن کامل